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Abstract. Constant entropy rate (conditional entropies must remain constant as the 
sequence length increases) and uniform information density (conditional probabilities 
►^ ' must remain constant as the sequence length increases) are two information theoretic 

^\ I principles that are argued to underlie a wide range of linguistic phenomena. Here we 

1^ I revise the predictions of these principles to the light of Hilberg's law on the scaling 

of conditional entropy in language and related laws. We show that constant entropy 

rate (CER) and two interpretations for uniform information density (UID), full UID 

and strong UID, are inconsistent with these laws. Strong UID implies CER but the 

fT^ ■ reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated 

sequences that are totally unrealistic. We conclude that CER and its particular cases 
are incomplete hypotheses about the scaling of conditional entropies. 

j^ I Keywords: constant entropy rate, uniform information density, Hilberg's law. 
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1. Introduction 

Uniform information density and constant entropy rate [H [2] are two information 
theoretic principles that have been put forward to explain various linguistic phenomena, 
e.g., syntactic reduction [Il[3] and the frequency of word orders |4]. In order to present 
these principles, we provide some definitions. Formally, m is a linguistic sequence of n 
elements, e.g., words or letters, i.e. u = Xi, ...,Xn, where x, is the i-th elements of the 
sequence, f/j is defined as the set of elements that can appear in the z-th position of the 

sequence and U as the set of all possible sequences of length n, i.e. [/ C f/^ x x f/„. 

Both U and all the tZj's are support sets, i.e. all their members have non-zero probability 
and all non-members have zero probability. X^ is the random variable that takes values 
from the set of elements f/j. Hereafter, Xj stands for an element of f/j or a value of Xi 
by default. 

The constant entropy rate (CER) hypothesis states that the conditional entropy 
of an element given the previous elements remains constant [2]. To express it formally, 
H{Xi\Xi, ...,Xi_i) is defined as the Shannon conditional entropy of the i-th element of 
a sequence given the i — 1 preceding elements. The CER states that H{Xi\Xi, ..., Xj_i) 
remains constant as i increases (z = 1,2, ...,n), i.e. 

H{X,) = H{X,\X^) = ... = i/(X„|Xi,...,X„_i). (1) 

The uniform information density (UID) hypothesis P, |3] states that the conditional 
probability of an element given the previous elements should remain constant. To 
express it formally, p{xi\xi, ..., Xj_i) is defined as the probability of Xj given the preceding 
elements in u. We say than a particular utterance u = Xi,...,Xn satisfies the UID 
condition if and only if 

p{Xi) =p{x2\Xi) = ... =p{Xn\Xi,...,Xn-l). (2) 

Here we will review the validity of CER and UID to the light of the real scaling 
of the conditional and other entropies as n increases, or equivalently, as i, the length 
of the prefix, increases [5], El [71 El |9] . We will show that the scaling of these entropies 
is inconsistent with CER and two interpretations of UID, strong UID and full UID, a 
particular case of strong UID. In essence, our arguments are the following. First, CER 
is inconsistent with Hilberg's law, a hypothesis on the entropy of natural language 
made on the base of Shannon's famous experiment [5]. Hilberg's law states that 
H{Xn\Xi, ...,Xn~i) ~ n~^, with /3 ^ 1/2 j6], whereas CER means /3 = 0. Second, 
strong UID, i.e., all the utterances in U satisfy the UID condition, is a particular 
case of CER. The latter can be easily shown. By taking logarithms and multiplying 
by p{u) = p{xi, ...,Xn) on the UID condition (Eq. [2|), the strong UID can be written 
equivalently as 

p{u) logp(xi) = p{u) \ogp{x2\xi) = ... 

= p{u) \ogp{Xn\Xi, ..., Xn-l) 
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for any m = Xi, ...,x„ G t/. Summing over all m G f/ (the utterances that do not belong 
to U have no effect thanks to the convention OlogO = jl^) and inverting the sign, the 
strong UID leads to 

-^p{u) \ogp{xi) = - J2p(^) logp(x2|a;i) = ... 

u u 

u 

where each part corresponds to the definition of a Shannon (conditional) entropy, i.e. 
the definition of CER in Eq. [1] is recovered. 

These ideas are developed in the coming sections, which will not only examine the 
meaning of strong UID but also that of full UID, a particularly degenerated version of 
real language where sequences of symbols are uncorrelated and entropies are maximum. 

2. The uniform information density hypothesis 

First, let us inspect some consequences of the UID hypothesis. One of the major 
challenges of the uniform density hypothesis is defining a criterion for the applicability 
of the hypothesis. UID was originally defined on a single sequence [1]. The utility and 
power of the hypothesis depends on its scope: the more sequences UID concerns, the 
better. We start with a very ambitious UID hypothesis, namely that UID holds for any 
sequence in U that can be formed combining the elements of f/j, i.e. 

f/ = f/i X ... X Un. (3) 

We call it full UID. We also consider a weaker but still strong version, where UID 
holds also for any sequence in U but Eq. [3] does not need to be satisfied. This version 
is called strong UID. In fact, full UID implies strong UID but the reverse is not 
true. To see the latter, consider support set U = {{a,b),{a,c),{d,e),{d, f)}, where 
p{X2 = X2IX1 = Xi) = p{Xi = Xi) = 1/2 for any (xi,X2) G U. We thus have strong 
UID but the full UID fails to hold since U j^ Ui x U2 = {a, d} x {b, c, e, /}. 

The UID condition can be written in terms of the joint probability. By the chain 
rule of conditional probability 

Pu = piXi)p{x2\Xi)...p{Xn\Xi, ...,Xn-l). (4) 

and Eq. [2] we obtain that the UID hypothesis implies 

p{Xi,...,Xn) =p(Xi)". (5) 

That is, strong UID means that all sequences beginning with the same word are equally 
likely. Furthermore, noting that, by definition, we have 

PiXi) = Yl J9(xi,...,x„) (6) 

and applying Eq. [5l we obtain 

p{xi)=p{x,r E l=p{x^r\U{X^ = Xl)\ 

X2i ..., Xni 

(xi,...,x„) G U 
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where U{Xi = Xi) is the subset of U containing all the sequences where Xi = Xi. 
Therefore, 

p{xi) = \U{Xi = xi)\^. (7) 

Taking in particular n = 2, we obtain that p{xi) must be a reciprocal of a natural 
number. 

2.1. Full UID 

With the help of the properties of the UID hypothesis above it is easy to show that full 
UID, implies, that for any sequence Xi,...,x„, 



(i) Xi,...,Xn are independent, i.e., 

n 

p(xi,...,x„) = np(^»)- 

(ii) The sets of elements that can appear at each position of the sequence have the same 
cardinality, i.e. \Ui\ = ... = |f/„|. 

(iii) All words occurring in the same position are equally likely, i.e. p{xi) = l/|?7i|. 

If UID holds for any sequence beginning with Xi that can be formed by combining 
elements from U2, ••., Un, then \U{Xi = Xi)\ = nr=2 l^d ^^^ Eq. [7]becomes 



p{xi 



urn 

.1=2 



1 

l-n 



(8) 



for n > 2. Eq. [8] indicates that p{xi) is the same for any Xi G f/i. Thus, the condition 

xiGXi 

gives p{xi) = l/\Ui\ and the UID condition in Eq. [S] becomes 

p(xi,...,x„) = |f/i|-". (9) 

Now we will derive p{xi) for i > 2. Employing Eq. [9l Eq. [6] can be written as (assuming 
i >2) 

Tin \TT \ «— 1 n. 

l^»l j=2 j=i+l 

Notice that, again, p{xi) is the same for any Xj G X^. Thus, the condition 

gives p{xi) = l/\Ui\ for any i = l,...,n. Now, we will show that \Ui\ = \Ui\ for any 
z = 1, ...,n. By definition, we have 

p(Xi,...,Xi_i) =^p(Xi,...,Xi), 
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which, thanks to UID (recall Eq. [H]), becomes 

p{x,,...,xi.,) = \u,r E i = \uirm. 

XiGUi 

Applying UID again (Eq.[9]) tothel.h.s, gives |f/i|^~* = |f/i|~*|f/i|, an thus \Ui\ = \Ui\ and 
p{xi) = l/\Ui\ for any i = 1, ...,n. Now it is easy to show that Xi, ..., Xn are independent. 
The definition of independence, i.e. p{xi, ...,x„) = ]Xi=iP{xi), holds trivially under UID 
since then p{xi, ...,Xn) = |t/j|"' and p{xi) = l/\Ui\ for any i = l,...,n. Finally, notice 
that full UID means that p{Xi = Xi) = l/\Ui\ but does not imply that the elements 
of the sequences (regardless of their position) are equally likely. If all the sequences 
of f/ = {{a, a), (a, c), {b,a), {b,c)} have probability 1/4, one then has full UID but the 
probability of producing a is 4/8 while the probability of producing c is 2/8. 

2.2. The relationship between CER and strong UID 

Unravelling the relationship between CER and UID is also in need. For instance, |4] is 
based on the idea of UID but its mathematical implementation is based on the definition 
of CER. Strong UID implies CER (Section [1]) but it will be shown that the reverse 
implication does not hold. Consider CER with n = 2, i.e. H{X2\Xi) = H{Xi), and 
independence between Xi and X2, i.e. H{X2\Xi) = H{X2). Thus, H{Xi) = H{X2). 
Assume also that Ui = {a, b} and U2 = {c, d}. If one has p{Xi = a) = p{X2 = c) = 2/5, 
p{Xi = b) = p{X2 = d) = 3/5, then one has CER but strong UID does not hold 
because 2/5 is not the reciprocal of a natural number, a condition for strong UID 
noticed beforehand. 

3. The real scaling of entropies versus constant entropy rate 

A serious consequence of the properties of full UID is that it is totally unrealistic with 
respect to natural language for several reasons. First, full UID leads to a sequence of 
independent elements, while long range correlations pervade linguistic sequences both 
at the level of letters and the level of words, e.g. [IH [3 [121 [131 [H] . Second, full UID is 
problematic because entropies are maximum. H{Xi) is maximum for any i = 1, ...,n as 
H[Xi) = log \Ui\. The joint entropy is also maximum because [10] 



h{x,,...,x^)<J2h{x,) (10) 

in general but full UID transforms the inequality of Eq. [10] into a mere equality because 
the elements making a sequence are independent. Since entropy is a measure of cognitive 
cost [T5l [16], full UID means the entropy related costs are maximum. 

Last but not least, the plausibility of UID or CER for natural language is 
undermined by the results of celebrated experiments. Let He{Xn\Xi, ...,Xn-i) be an 
estimate of H{Xn\Xi, ...,X„„i) from real data and e„, the error of the estimate, i.e., 

He[Xn\Xi, ...,Xn^l) = H{Xn\Xi, ...,X„_i) + En, 
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where e„ > in general (by the nonegativity of KuUback-Leibler divergence and Kraft 
inequahty, it follows that the errors En are positive if entropy is estimated by means 
of universal probability or universal coding |10]). Hilberg [6] reanalyzed Shannon's 
estimates of conditional entropy for English |5] and discovered that 

with Ce > 0, a ~ 0.5, and n < 100 characters. Extrapolating Hilberg's law, Eq. [HI for 
n ^ 100 requires some caution. If accepted with £„ = 0, Eq. [TT]would imply asymptotic 
determinism of human utterances with an entropy rate h = lim„_^oo H{Xn\Xi, ..., X„,_i) 
equal to 0. Thus it is more plausible to accept that the scaling law of the true entropy 
(not its estimate) is 

H{X^\X^, ..., X„_i) ^ Cn"-i + h, (12) 

with C < Cf. and a sufficiently small constant /i > 0. That the modified Hilberg law, 
Eq. [121 is indeed valid for natural language for n ^ 100 characters can be corroborated 
by the following fact: the modified Hilberg's law implies a lower bound for the growth 
of V , the observed vocabulary size, as a function of the T, the text length. Namely, Eq. 
[12] implies that V grows at least as ~ T"/logT [17], which is in good accordance with 
the real growth of V [T8] . 

In contrast, CER is a competing hypothesis on if (X„|Xi, ...,X„_i), the true 
conditional entropy. Assuming that CER, Eq. [1], holds in spite of Hilberg's law, Eq. 
[TT| is equivalent to stating that the errors e„ k. Cf.n'^"^ — H{Xi) are systematically 
decreasing as n increases and negative for moderate n if H{Xi) is large enough. This 
seems unrealistic since, as we have stated above, the errors of entropy estimates should 
be positive in general and it is unlikely that the errors are systematically diminishing 
(undersampling, a very important source of error, usually increases as n increases). 

The disagreement between real language and CER concerning the decay of 
conditional entropy can be rephrased in terms of other entropic measures: H{Xi, ..., X„) 
the joint or block entropy [8], |9j and H{Xi, ...,Xn)/n, the joint entropy per unit ^. It 
is easy to infer the scaling of these entropies from the modified Hilberg law, Eq. [121 by 
means of the chain rule of the joint entropy [10], which yields 

H{Xi,...,Xn) = yH{Xi\Xi,...,Xi^i)^ fCm^^i + /ilrfm = a-^Cn" + /m, (13) 

and thus 

H{Xi,...,Xr,)/n^a-^Cn''-^ + h. (14) 

In contrast, CER predicts a linear growth of the joint entropy, i.e. 

H{Xu...,Xn)/n = H{X,), (15) 

which can be proven by applying the definition of CER in Eq. [1] to the chain rule of 
joint entropy |10) . 

As an independent confirmation of Shannon's research, according to Ref. [7], the 
estimates of H{Xi, ...,Xn)/n for sequences of letters from an English novel show good 
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agreement with Eq. [THwith a = 0.5. Furthermore, the estimates of H{Xi, ..., X„) grow 
subhnearly with n not only for texts in Enghsh but also for sequences from many other 
languages with different kinds of units [9l [8]. CER predicts that H{Xi, ...,Xn)/n is 
constant, Eq. [151 ^ strikingly different result. Therefore, the inconsistencies of CER are 
robust in the sense that do not depend on entropic measure, the language or the units 
of the sequence being considered. The same inconsistencies concern full and strong UID 
as it has been shown in Sections [1] and |2] that Full UID =^ Strong UID =^ CER. 

4. Discussion 

We have shown that CER and the two interpretations of the UID hypothesis (full UID 
and strong UID) are inconsistent with the scaling law for entropy of natural language 
called Hilberg's law. Future research should address the challenge of what modifications 
of UID/CER can be consistent with real language. In order to save UID/CER, we 
envisage that probabilities and conditional entropies for real language stem from a 
conflict between principles: one acting towards UID/CER and another acting against 
UID/CER. A similar conflict between principles has been hypothesized for Zipf's law 
for word frequencies, namely that the frequency of the i-th most frequent word of a text 
is ~ i~'^ [19]: the law emerges from a conflict between two principles, minimization of 
the entropy of words and maximization of the mutual information between words and 
meanings where none of the principles is totally realized [201 |2T| . Indeed, Refs. [T71 [22] 
show that there is a close relationship between Zipf's law and Hilberg's law so the 
conflict of principles that leads to Zipf's law may be the same that prevents UID/CER 
from the full realization. 

It is worth noting that there are simple information theoretic principles that lead 
to UID/CER. For instance, minimization of conditional entropy, leads to Eq. [1] with 
H[Xi) = 0. Interestingly, entropy minimization can be easily justified as it implies 
the minimization of cognitive cost [T5l [T6] and is used to explain Zipf's law for word 
frequencies [20| [2T] . Clearly, this conditional entropy minimization could not be acting 
alone as its total realization implies that all possible sequences have probability zero 
except one and therefore Hilberg's law could not hold. As the force towards UID/CER is 
not working alone, the nature of the second factor in conflict must be clarified. The view 
of UID/CER as a principle in conflict means that all the currently available explanations 
of linguistic phenomena based upon UID/CER, e.g., |2l [H [3], are a priori incomplete. 
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